Detecting potable water

It is year 2050 and the reserves of potable water around the world are becoming more and more scarce. It is necessary to find new sources of potable water. With a dataset containing 9 measurements from more than 3000 different reservoirs, you are required, as chief data scientist of H2cO2, to flag those reservoirs that are likely to contain potable water. H2cO2 operatives cannot afford to check all reservoirs in-site, so they need your help to tell them which reservoirs they should check out more closely.

So the task is to detect the potable water reservoirs (positive cases). It is very important to find more potable water for the population, so missing a positive case is much worse than classifying a negative case as positive. In other words, we want to favor recall over precision.

For this reason, we will be using the $F2$ measure, which assigns more weight to recall.

Then the CEO of H2cO2 tells you:

The future of humanity rests on your shoulders

BTW, this data we are giving you, it is actually invented you know, it's just to see, so don't expect great results. Like, if you plot it on a 2d space, you will not be able to differentiate the classes lol. And we have even put some missing values in there so that you can have more fun with it.

Data source

Import Libraries

Read data

Obtain overview report of the data

There are 9 explanatory variables, all of them numeric, and one target variable (Potability). All explanatory variables follow an approximate Gaussian distribution, as per the plots displayed above.

The target classes are imbalanced: 61% belong to nonpotable (0) and 39% to potable (1).

There seems to be little colinearity among variables. The highest correlations occur bewteen variables Solids, Chloramines and Sulfate, at a range of 15%-30%. Given this, feature selection is not indicated, and dimensionality reduction does not seem necessary either.

Overall, 4.4% of the fields contain a missing value. The columns which present missing values are Sulfate (23.8%), ph (15.0%) and Trihalomethanes (4.9%).

Dealing with missing values

More than a third of rows contain at least a missing value. We cannot just remove these rows, as that would imply a huge loss of information.

Moreover, it seems that if we take into account the target, missings are not completely at random, as the proportion of missings for potable water is notably lower than for nonpotable water. Without domain knowledge it is hard to know why this is so. Is it the case that it is harder to take measurements when water is nonpotable?

What's more, the counting of missings in the cell above takes into account three variables at the same time and is conditioning on the class, that is, it estimates the probability that an observation contains at least one missing: $P(ph_{missing} \vee Sulfate_{missing} \vee Trihalomethanes_{missing}|Potability)$.

We can therefore say that the overall amount of missings is dependant on the class. However, with this analysis we cannot say that class influences each of these three variables in particular, at the univariate level.

Let's look closer at missings at the univariate and bivariate level:

Let's build a dummy dataframe by making dummy variables out of the three columns containing NaNs. Set to 1 the value if there is a NaN, and to 0 otherwise.

And now let's see if there is correlation between the missingness of these three variables and the other variables:

From this bivariate analysis, it seems that the variables not containing missings cannot explain if there is a missing in either of the three columns with missing values, so the assumption of missing at random does not seem reasonable.

Therefore, we will assume that missing values are missing completely at random, given the class. Notice however, that the target is unknown during production time, therefore we cannot take it into account to perform the imputation. For this reason, we will stick to the assumption that missing values are missing completely at random.

-- to review --

Since there is considerable amount of missings, imputing with the median would flatten the regression curve, which can lead to a higher bias, so we discard the median. For the same reason, imputing with the average would considerably diminish the variance of the samples, which can lead to overfitting. In order to limit bias and overfitting in our models, we will impute a value derived from the k nearest neighbours of a sample, using the average weighted by the inverse of the distance. Since kNN works with distances, and we do not want to favor variables whose dynamic range is smaller than other variables, we need to scale the data beforehand. For the imputation, let's use $k=5$, although this is quite arbitrary.

Split data into training and testing sets

Let's visualize data using MDS to get a feel of its characteristics. This takes a while to compute (about 5 minutes).

There are some outliers for both classes.

The classes are very much superposed, so this is going to be a difficult classification problem.

Hyperparameter Optimization and Model Comparison

Let's compare models on the training set. To make the comparison more solid, we will tune the models inside the CV procedure, by using a nested fine-tuning as described here.

We will also try reducing dimensionality prior to the classifier by using PCA, and we will also check if upsampling the minority class or undersampling the majority class helps. More precisely, we will try:

After comparing the models, we will fine tune each one separately on the whole training set, and make them predict on the testing set.

Since our main goal is to find new sources of potable water, we want to avoid false negatives (potable water classified as nonpotable), it will be appropiate to use the $F2$ measure, which favors recall over precision.

Now visualize the results with a boxplot.

It seems that dimensionality reduction with PCA is not bringing any benefits in terms of F2, so let's skip it altogether. However both upsampling and downsampling are driving F2 up. The best approach seems to be the following:

Let's focus on these promising models, fine-tune their hyperparamaters on the whole training set and then have them predict on the testing set.

The results are:

We can see that both RF and XGB have obtained higher F2 and recall. Note however, that these two models actually have less discriminatory power than SVM and MLP, as can be seen by their lower AUC.

If we pass RF or XGB to production, fewer potable water sources will be missed by H2c2O, but the amount of nonpotable sources that are inspected due to a false positive will be higher, leading to higher labor and time costs for H2c2O. If we choose SVM or MLP, on the other hand, more potable water sources will be missed, but fewer resources will be wasted on exploring water sources that are nonpotable but have been flagged as potable by the model.

All in all, there is no perfect model here, so which one we choose will depend on business considerations.